{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prepare Dataset for Model Training and Evaluating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Amazon Customer Reviews Dataset\n",
    "\n",
    "https://s3.amazonaws.com/amazon-reviews-pds/readme.html\n",
    "\n",
    "## Schema\n",
    "\n",
    "- `marketplace`: 2-letter country code (in this case all \"US\").\n",
    "- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.\n",
    "- `review_id`: A unique ID for the review.\n",
    "- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.\n",
    "- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.\n",
    "- `product_title`: Title description of the product.\n",
    "- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).\n",
    "- `star_rating`: The review's rating (1 to 5 stars).\n",
    "- `helpful_votes`: Number of helpful votes for the review.\n",
    "- `total_votes`: Number of total votes the review received.\n",
    "- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?\n",
    "- `verified_purchase`: Was the review from a verified purchase?\n",
    "- `review_headline`: The title of the review itself.\n",
    "- `review_body`: The text of the review.\n",
    "- `review_date`: The date the review was written."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import pandas as pd\n",
    "\n",
    "sess   = sagemaker.Session()\n",
    "bucket = sess.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download\n",
    "\n",
    "Let's start by retrieving a subset of the Amazon Customer Reviews dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Software_v1_00.tsv.gz' ./data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "\n",
    "df = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', \n",
    "                 delimiter='\\t', \n",
    "                 quoting=csv.QUOTE_NONE,\n",
    "                 compression='gzip')\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "%config InlineBackend.figure_format='retina'\n",
    "\n",
    "df[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')\n",
    "plt.xlabel('Star Rating')\n",
    "plt.ylabel('Review Count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Balance the Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils import resample\n",
    "\n",
    "five_star_df = df.query('star_rating == 5')\n",
    "four_star_df = df.query('star_rating == 4')\n",
    "three_star_df = df.query('star_rating == 3')\n",
    "two_star_df = df.query('star_rating == 2')\n",
    "one_star_df = df.query('star_rating == 1')\n",
    "\n",
    "# Check which sentiment has the least number of samples\n",
    "minority_count = min(five_star_df.shape[0], \n",
    "                     four_star_df.shape[0], \n",
    "                     three_star_df.shape[0], \n",
    "                     two_star_df.shape[0], \n",
    "                     one_star_df.shape[0]) \n",
    "\n",
    "five_star_df = resample(five_star_df,\n",
    "                        replace = False,\n",
    "                        n_samples = minority_count,\n",
    "                        random_state = 27)\n",
    "\n",
    "four_star_df = resample(four_star_df,\n",
    "                        replace = False,\n",
    "                        n_samples = minority_count,\n",
    "                        random_state = 27)\n",
    "\n",
    "three_star_df = resample(three_star_df,\n",
    "                        replace = False,\n",
    "                        n_samples = minority_count,\n",
    "                        random_state = 27)\n",
    "\n",
    "two_star_df = resample(two_star_df,\n",
    "                        replace = False,\n",
    "                        n_samples = minority_count,\n",
    "                        random_state = 27)\n",
    "\n",
    "one_star_df = resample(one_star_df,\n",
    "                        replace = False,\n",
    "                        n_samples = minority_count,\n",
    "                        random_state = 27)\n",
    "\n",
    "df_balanced = pd.concat([five_star_df, four_star_df, three_star_df, two_star_df, one_star_df])\n",
    "df_balanced = df_balanced.reset_index(drop=True)\n",
    "\n",
    "df_balanced.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_balanced[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='Breakdown by Star Rating')\n",
    "plt.xlabel('Star Rating')\n",
    "plt.ylabel('Review Count')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_balanced.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Split the Data into Train, Validation, and Test Sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# Split all data into 90% train and 10% holdout\n",
    "df_train, df_holdout = train_test_split(df_balanced, \n",
    "                                        test_size=0.10,\n",
    "                                        stratify=df_balanced['star_rating'])\n",
    "\n",
    "# Split holdout data into 50% validation and 50% test\n",
    "df_validation, df_test = train_test_split(df_holdout,\n",
    "                                          test_size=0.50, \n",
    "                                          stratify=df_holdout['star_rating'])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pie chart, where the slices will be ordered and plotted counter-clockwise:\n",
    "labels = ['Train', 'Validation', 'Test']\n",
    "sizes = [len(df_train.index), len(df_validation.index), len(df_test.index)]\n",
    "explode = (0.1, 0, 0)  \n",
    "\n",
    "fig1, ax1 = plt.subplots()\n",
    "\n",
    "ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', startangle=90)\n",
    "\n",
    "# Equal aspect ratio ensures that pie is drawn as a circle.\n",
    "ax1.axis('equal')  \n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show 90% Train Data Split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='90% Train Breakdown by Star Rating')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show 5% Validation Data Split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_validation.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_validation[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='5% Validation Breakdown by Star Rating')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Show 5% Test Data Split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test[['star_rating', 'review_id']].groupby('star_rating').count().plot(kind='bar', title='5% Test Breakdown by Star Rating')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Select `star_rating` and `review_body` for Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = df_train[['star_rating', 'review_body']]\n",
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Convert `star_rating` into label ids (0-4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def to_id(rating):\n",
    "    rating = int(rating)\n",
    "    if rating == 1:\n",
    "        return 0\n",
    "    elif rating == 2:\n",
    "        return 1\n",
    "    elif rating == 3:\n",
    "        return 2\n",
    "    elif rating == 4:\n",
    "        return 3\n",
    "    elif rating == 5:\n",
    "        return 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train['star_rating_id'] = df_train.star_rating.apply(to_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = df_train[['star_rating_id', 'review_body']]\n",
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Write a CSV With No Header for Comprehend "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jumpstart_train_path = './data.csv'\n",
    "df_train.to_csv(jumpstart_train_path, index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Upload Train Data to S3 for Comprehend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_s3_prefix = 'data'\n",
    "jumpstart_train_s3_uri = sess.upload_data(path=jumpstart_train_path, key_prefix=train_s3_prefix)\n",
    "jumpstart_train_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls $jumpstart_train_s3_uri"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Store the location of our train data in our notebook server to be used next"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store jumpstart_train_s3_uri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%javascript\n",
    "Jupyter.notebook.save_checkpoint();\n",
    "Jupyter.notebook.session.delete();"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
